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Abstract 

The i.p regression problem takes as input a matrix A S M"^'', a vector h £ R", and a number p S 
[1, oo), and it returns as output a number Z and a vector a;opT S R'' such that Z = min^^^d \\Ax — b\\p = 
\\AxofT — b\\p. In this paper, we construct coresets and obtain an efficient two-stage sampHng-based ap- 
proximation algorithm for the very overconstrained {n ^ d) version of this classical problem, for all 
p G [1, oo). The first stage of our algorithm non-uniformly samples fi — (9(36P(i"^'*'^{p/2+i.p}+i) rows 
of A and the corresponding elements of b, and then it solves the £p regression problem on the sample; we 
prove this is an 8-approximation. The second stage of our algorithm uses the output of the first stage to 
resample fi / constraints, and then it solves the ip regression problem on the new sample; we prove this 
is a (1 + e) -approximation. Our algorithm unifies, improves upon, and extends the existing algorithms for 
special cases of ip regression, namely p = 1, 2 ifTTlfT?! . In course of proving our result, we develop two 
concepts — well-conditioned bases and subspace-preserving sampling — that are of independent interest. 
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1 Introduction 

An important question in algorithmic problem solving is whether there exists a small subset of the input 
such that if computations are performed only on this subset, then the solution to the given problem can be 
approximated well. Such a subset is often known as a coreset for the problem. The concept of coresets has 
been extensively used in solving many problems in optimization and computational geometry; e.g., see the 
excellent survey by Agarwal, Har-Peled, and Varadarajan [21. 

In this paper, we construct coresets and obtain efficient sampling algorithms for the classical Ip regres- 
sion problem, for all p G [1, oo). Recall the ip regression problem: 

Problem 1 {ip regression problem). Let \\-\\p denote the p-norm of a vector Given as input a matrix A S 
j^nxm^ a target vector b G M", and a real number p G [1, cxo), find a vector Xqpt cind a number Z such that 

Z= m.\n\\Ax-h\\=\\AxovT-b\\. (1) 
In this paper, we will use the following Ip regression coreset concept: 

Definition 2 {ip regression coreset). Let < e < 1. A coreset /or Problem\l\is a set of indices X such that 
the solution Xqpt to vai'ax^w^\\Ax — h^^, where A is composed of those rows of A whose indices are in X 
and b consists of the corresponding elements ofb, satisfies \\Axovt — b\\p < (1 + e) min^. \\Ax — b\\p. 

Ifn^m, i.e., if there are many more constraints than variables, then ^ is an overconstrained Ip regression 
problem. In this case, there does not in general exist a vector x such that Ax = b, and thus Z > 0. Over- 
constrained regression problems are fundamental in statistical data analysis and have numerous applications 
in applied mathematics, data mining, and machine learning |[T6l [TOl . Even though convex programming 
methods can be used to solve the overconstrained regression problem in time 0{{mn)'^), for c > 1, this 
is prohibitive if n is largeQ This raises the natural question of developing more efficient algorithms that 
run in time 0{iTfn), for c > 1, while possibly relaxing the solution to Equation ([T]). In particular: Can 
we get a ^-approximation to the ip regression problem, i.e., a vector x such that \\Ax — b\\p < kZ, where 
K > 1? Note that a coreset of small size would strongly satisfy our requirements and result in an efficiently 
computed solution that's almost as good as the optimal. Thus, the question becomes: Do coresets exist for 
the £p regression problem, and if so can we compute them efficiently? 

Our main result is an efficient two-stage sampling-based approximation algorithm that constructs a core- 
set and thus achieves a (1 + e) -approximation for the ip regression problem. The first-stage of the algorithm 
is sufficient to obtain a (fixed) constant factor approximation. The second-stage of the algorithm carefully 
uses the output of the first-stage to construct a coreset and achieve arbitrary constant factor approximation. 
1.1 Our contributions 

Summary of results. For simplicity of presentation, we summarize the results for the case of m = d = 
rank(A). Let k = max{p/2 + 1, p} and let (p{r, d) be the time required to solve an ip regression problem 
with r constraints and d variables. In the first stage of the algorithm, we compute a set of sampling probabil- 
ities pi, . . . ,p„ in time 0{nd^ logn), sample n = 0{'i&'d^~^^) rows of A and the corresponding elements 
of b according to the pj's, and solve an Ip regression problem on the (much smaller) sample; we prove this 
is an 8-approximation algorithm with a running time of O {nd^ log n + 4>{ri, d)) . In the second stage of the 
algorithm, we use the residual from the first stage to compute a new set of sampling probabilities qi, . . . ,qn, 
sample additional f2 = 0(fi/e^) rows of A and the corresponding elements of b according to the qi's, and 
solve an £p regression problem on the (much smaller) sample; we prove this is a (1 + e)-approximation 

'For the special case of p = 2, vector space methods can solve the regression problem in time 0{m?n), and if p — 1 linear 
programming methods can be used. 
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algorithm with a total running time of O {nd^ log n + 4>{f2,d)) (Section |4l). We also show how to extend 
our basic algorithm to commonly encountered and more general settings of constrained, generalized, and 
weighted £p regression problems (Section |5]l. 

We note that the Ip regression problem for p = 1, 2 has been studied before. For p = 1, Clarkson ifTTI 
uses a subgradient based algorithm to preprocess A and b and then samples the rows of the modified problem; 
these elegant techniques however depend crucially on the linear structure of the li regression proble 
Furthermore, this algorithm does not yield coresets. For p = 2, Drineas, Mahoney, and Muthukrishnan 1131 
construct coresets by exploiting the singular value decomposition, a property peculiar- to the I2 space. Thus 
in order to efficiently compute coresets for the ip regression problem for all p G [l,cxo), we need tools 
that capture the geometry of Ip norms. In this paper we develop the following two tools that may be of 
independent interest (Section O. 

(1) Well-conditioned bases. Informally speaking, if [/ is a well-conditioned basis, then for all z G W^, \\z\\p 
should be close to ||?7z||p. We will formalize this by requiring that for all z G U.'^, \\z\\^ multiplicatively 
approximates by a factor that can depend on d but is independent of n (where p and q are conjugate; 
i.e., q = p/{p — 1)). We show that these bases exist and can be constructed in time 0{nd^ logn). In fact, 
our notion of a well-conditioned basis can be interpreted as a computational analog of the Auerbach and 
Lewis bases studied in functional analysis |[25l . They are also related to the barycentric spanners recently 
introduced by Awerbuch and R. Kleinberg lH (Section 13.11 ). J. Kleinberg and Sandler lITTl defined the 
notion of an ^1 -independent basis, and our well-conditioned basis can be used to obtain an exponentially 
better "condition number" than their construction. Further, Clarkson flTl defined the notion of an "ii- 
conditioned matrix," and he preprocessed the input matrix to an £1 regression problem so that it satisfies 
conditions similar to those satisfied by our bases. 

(2) Subspace-preserving sampling. We show that sampling rows of A according to information in the rows of 
a well-conditioned basis of A minimizes the sampling variance and consequently, the rank of A is not lost by 
sampling. This is critical for our relative-error approximation guarantees. The notion of subspace-preserving 
sampling was used in |[T3l for p = 2, but we abstract and generalize this concept for all p G [1, 00). 

We note that f or p = 2 , our sampling complexity matches that of |[T3l , which is O ( c?^ / ) ; and for p = 1 , 
it improves that of IIIl from 0{d^-^{\og d) /e^) to ©(d^ Ve^). 

Overview of our methods. Given an input matrix A, we first construct a well-conditioned basis for A and 
use that to obtain bounds on a slightly non-standard notion of a p-norm condition number of a matrix. The 
use of this particular- condition number is crucial since the variance in the subspace preserving sampling 
can be upper bounded in terms of it. An e-net argument then shows that the first stage sampling gives us a 
8-approximation. The next twist is to use the output of the first stage as a feedback to fine-tune the sampling 
probabilities. This is done so that the "positional information" of b with respect to A is also preserved in 
addition to the subspace. A more cai^eful use of a different e-net shows that the second stage sampling 
achieves a (1 + e)-approximation. 

1.2 Related work 

As mentioned earlier, in course of providing a sampling-based approximation algorithm for li regression, 
Clarkson iTTI shows that coresets exist and can be computed efficiently for a controlled £1 regression prob- 
lem. Clarkson first preprocesses the input matrix A to make it well-conditioned with respect to the £1 
norm then applies a subgradient-descent-based approximation algorithm to guarantee that the £1 norm of 
the target vector is conveniently bounded. Coresets of size 0{d^'^ log d/e^) are thereupon exhibited for 

"Two ingredients of 1111 use the linear structure: the subgradient based preprocessing itself, and the counting argument for the 
concentration bound. 
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this modified regression problem. For the £2 case, Drineas, Mahoney and Muthukrishnan |[T3l designed 
sampling strategies to preserve the subspace information of A and proved the existence of a coreset of rows 
of size 0{(P /e^) — for the original £2 regression problem; this leads to a (1 + e)-approximation algorithm. 
While their algorithm used 0{n(f) time to construct the coreset and solve the £2 regression problem — 
which is sufficient time to solve the regression problem — in a subsequent work, Sai^los |fT9l improved the 
running time for solving the regression problem to 0{nd) by using random sketches based on the Fast 
Johnson-Lindenstrauss transform of Ailon and Chazelle ||3l. 

More generally, embedding d-dimensional subspaces of Lp into £p^'^^ using coordinate restrictions has 
been extensively studied ll20l l8l l22ll23ll2TI . Using well-conditioned bases, one can provide a constructive 
analog of Schechtman's existential Li embedding result |[20l (see also HI), that any d-dimensional subspace 
of Li[0, 1] can be embedded in £\ with distortion (1 + e) with r = 0{d? /e^), albeit with an extra factor of 
^/d in the sampling complexity. Coresets have been analyzed by the computation geometry community as 
a tool for efficiently approximating various extent measures HI |2l ; see also |[T5l |6l [141 for applications of 
coresets in combinatorial optimization. An important difference is that most of the coreset constructions are 
exponential in the dimension, and thus applicable only to low-dimensional problems, whereas our coresets 
are polynomial in the dimension, and thus applicable to high-dimensional problems. 

2 Preliminaries 

Given a vector x G M™, its p-norm is = YllLii\^iY'Y^^ ' ^^^^ "^he dual norm of ||-||p is denoted 
where 1/p+l/g = 1. Given a matrix A e W''"' , its generalized p-norm is \\\A\\\p = (XlILi YlT=Mij\^y^^- 
This is a submultiplicative matrix norm that generalizes the Frobenius norm from p = 2 to all p G [1, 00), 
but it is not a vector-induced matrix norm. The j-th column of A is denoted j4^j, and the i-th row is denoted 
Ai^. In this notation, = ll^illp^/^ = (Ei M^llp^/^- For x,x',x" G M™', it can be shown 

using Holder's inequality that \\x — x'W^ < 2^"^ (^\\x — x"\\^ + \\x" — x'\\^ . 

Two crucial ingredients in our proofs are e-nets and tail-inequalities. A subset Af{D) of a set D is called 
an e-net in D for some e > if for every x € D, there is a y G M{D) with ||x — y|| < e. In order to 
construct an e-net for D it is enough to choose M{D) to be the maximal set of points that are pairwise e 
apart. It is well known that the unit ball of a d-dimensional space has an e-net of size at most (S/e)"^ HI. 

Finally, throughout this paper, we will use the following sampling matrix formalism to represent our 
sampling operations. Given a set of n probabilities, pi G (0, 1], for i = 1, . . . , n, let 5 be an n x n 
diagonal sampling matrix such that Sa is set to l/pY^ with probability pi and to zero otherwise. Clearly, 
premultiplying A or 6 by 5 determines whether the i-th row of A and the corresponding element of b will 
be included in the sample, and the expected number of rows/elements selected is r' = Yl^=iPi- (^^ what 
follows, we will abuse notation slightly by ignoring zeroed out rows and regarding 5 as an r' x n matrix and 
thus SA as an r' X m matrix.) Thus, e.g., sampling constraints from Equation ([TJ and solving the induced 
subproblem may be represented as solving 

Z= min \\SAx-Sb\\„. (2) 

A vector x is said to be a ^-approximation to the tp regression problem of Equation for k > 1, if 
11^^ — < nZ. Finally, the Appendix contains all the missing proofs. 

3 Main technical ingredients 
3.1 Well-conditioned bases 

We introduce the following notion of a "well-conditioned" basis. 
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Definition 3 (Well-conditioned basis). Let Abe an n x m matrix of rank d, let p G [1, cxo), and let q be its 
dual norm. Then an n x d matrix U is an {a, f3,p)-'well-conditioned basis for the column space of A if (1 ) 
|||?7|||p < a, and (2) for all z € M.'^, \\z\\g < f3 \\Uz\\p. We will say that U is a p-well-conditioned basis /or 
the column space of A if a and (5 are d^^^\ independent ofm and n. 

Recall that any orthonormal basis U for span(^) satisfies both |||?7|||2 = \\U\\p = Vd and also ||z||2 = 
||{72:||2 for all z G M"', and thus is a {Vd, 1, 2)-well-conditioned basis. Thus, Definition |3] generalizes to an 
arbitrary p-norm, for p G [1, oo), the notion that an orthogonal matrix is well-conditioned with respect to 
the 2-norm. Note also that duality is incorporated into Definition |3] since it relates the p-norm of the vector 
z G R'^ to the g-norm of the vector Uz G M", where p and q are dualjl 

The existence and efficient construction of these bases is given by the following. 

Theorem 4. Let A be an n x m matrix of rank d, let p G [1, oo), and let q be its dual norm. Then there 

-+- 

exists an (a, /5,p)-well-conditioned basis U for the column space of A such that: if p < 2, then a = dp ^ 
and [3=1, if p = 2, then a = d^ and (3=1, and if p > 2, then a = df ^ and [5 = di ^. Moreover, U 
can be computed in 0{nmd + nd^ log n) time (or in just 0{nmd) time ifp = 2). 

Proof. Let A = QR, where Q is any n x d matrix that is an orthonormal basis for span(A) and R is 
a. d X m matrix. If p = 2, then Q is the desired basis U; from the discussion following Definition |3l 
a = Vd and [3 = 1, and computing it requires 0{nmd) time. Otherwise, fix Q and p and define the 
norm, = ||Q-z||p ■ A quick check shows that |H|qp is indeed a norm. = if and only if 

z = since Q has full column rank; ||7^;||qp = ||7Q-z||p = |7| \\Qz\\p = \-y\ II^^IIq^; and \\z + z'\\qp = 
\\Q{z + z')\\p < WQzWp + WQz'Wp = \\z\\q^p + Wz'Wq^p.) 

Consider the set C = {z G M'^ : ||^||qp < 1}, which is the unit ball of the norm IHIgp. In addition, 
define the d x d matrix F such that <Slj = G M'^ : z'^Fz < 1} is the Lowner-John ellipsoid of C. Since 
C is symmetric about the origin, {l/\/^)£i^j ^ C C ^T^j; thus, for all z G M'^, 

Ikllu < \\z\\q,p < -^Mu > (3) 

where \\z\\1j = z'^Fz (see, e.g. ^ pp. 413-4]). Since the matrix F is symmetric positive definite, we can 
express it as F = G^G, where G is full rank and upper triangular. Since Q is an orthogonal basis for 
span(^) and G is a d x d matrix of full rank, it follows that U = QG~^ is an n x d matrix that spans the 
column space of A. We claim that U = QG^^ is the desired p-well-conditioned basis. 

To establish this claim, let z' = Gz. Thus, ||2:||lj = z^ Fz = z^G^Gz = {Gz)'^Gz = z''^ z' = \\z'\\2- 
Furthermore, since G is invertible, z = G^^z', and thus = = ||QG~^2:'||p. By combining 

these expression with ([3]), it follows that for all z' G M'^, 

||^'||2< ||^^^'||p<^||^1l2 • (4) 

Since \\\U\\\^ = Y.j W^ifp = J2j WUejH^ < Yljd^ \\ej\\l = d2+\ where the inequality follows from 

1,1 

the upper bound in (|4jl, it follows that a = dp . If p < 2, then q > 2 and \\z\\^ < \\z\\2 for all 

z G M'^; by combining this with it follows that (3=1. On the other hand, if p > 2, then q < 2 and 

i_i i_i 
\\z\\g < di 2 ll^llg; by combining this with (l4jl, it follows that (3 = di 2. 

''For p = 2, Drineas, Mahoney, and Muthukrishnan used this basis, i.e., an orthonormal matrix, to construct probabilities to 
sample the original matrix. For p = 1, Clarkson used a procedure similar to the one we describe in the proof of Theorem |4] to 
preprocess A such that the 1-norm of z is a d\/d factor away from the 1-norm of Az. 
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In order to construct U, we need to compute Q and G and then invert G. Our matrix A can be de- 
composed into QR using the compact QR decomposition in 0{nmd) time. The matrix F describing the 
Lowner-John ellipsoid of the unit ball of 1 1 • 1 1 g ^ can be computed in O {nd^ log n) time. Finally, computing 
G from F takes 0{d^) time, and inverting G takes O(d^) time. □ 

Connection to barycentric spanners. A point set K = {Ki, . . . , Kd} C D C M"^ is a barycentric spanner 
for the set D if every z ^ D may be expressed as a linear combination of elements of K using coefficients in 
[— C, C], for C = 1. When C > 1, is called a C-approximate barycentric spanner. Barycentric spanners 
were introduced by Awerbuch and R. Kleinberg in ||5l. They showed that if a set is compact, then it has 
a barycentric spanner. Our proof shows that if A is an n x d matrix, then = R~^G~^^ € W^^'^ is a 
A/d-approximate barycentric spanner for D = {z : || < 1}. To see this, first note that each t^^ 
belongs to D since || Ar'^Hp = ||f/ej||p < ||ej||2 = 1, where the inequality is obtained from Equation 
Moreover, since spans W^, we can write any z G as z = t'^v. Hence, 

ll^'lloo < Ml < ii^^ii ^ = iM^ii < 1 



^ - ^ - „ Hp II lip n-~np 

where the second inequality is also obtained from Equation dH). This shows that our basis has the added 
property that every element z G D can be expressed as a linear combination of elements (or columns) of 

using coefficients whose £2 norm is bounded by 
Connection to Auerbach bases. An Auerbach basis U = {Ui,j}j^i for a d-dimensional normed space 
^ is a basis such that = 1 for all j and such that whenever y = J2j i^i the unit ball of A 

then \vj\ < 1. The existence of such a basis for every finite dimensional normed space was first proved by 
Herman Auerbach 14] (see also lfT2ll24l ). It can easily be shown that an Auerbach basis is an (a, /3, p)-well- 
conditioned basis, with a = d and /3 = 1 for all p. Further, suppose U is an Auerbach basis for span(A), 
where Ais anii x d matrix of rank d. Writing A = Ut, it follows that is an exact barycentric spanner 
for D = {z £R'^ : \\Az\\p < 1}. Specifically, each r~-^ G D since ||AtJ^.^||p = \\U^j\\p = 1. Now write 
z G as z = T~^u. Since the vector y = Az = C/z/ is in the unit ball of span(A), we have \uj\ < 1 for 
all I < j < d. Therefore, computing a barycentric spanner for the compact set D — which is the pre-image 
of the unit ball of span(^) — is equivalent (up to polynomial factors) to computing an Auerbach basis for 
span(A). 

3.2 Subspace-preserving sampling 

In the previous subsection (and in the notation of the proof of Theorem|4]), we saw that given p G [1, 00), 
any n x m matrix A of rank d can be decomposed as 

A = QR = QG~^GR = Ut , 

where U = QG^^ is a p-well-conditioned basis for span(j4) and r = GR. The significance of a p-well- 
conditioned basis is that we are able to minimize the variance in our sampling process by randomly sampling 
rows of the matrix A and elements of the vector b according to a probability distribution that depends on 
norms of the rows of the matrix U. This will allow us to preserve the subspace structure of span(A) and 
thus to achieve relative-error approximation guarantees. 

More precisely, given p G [1, 00) and any n x m matrix A of rank d decomposed as A = Ut, where 
U is an (a, /3,p)-well-conditioned basis for span(A), consider any set of sampling probabilities pi for 
i = 1, . . . , n, that satisfy: 

f mA\i \ 

Pi > mm { 1, ,„^^,„p r j- , (5) 
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where r = r (a, /3, p, d, e) to be determined below. Let us randomly sample the i*^ row of A with probability 
Pj, for alH = 1, . . . , n. Recall that we can construct a diagonal sampling matrix S, where each Su = l/p]^^ 
with probability pi and otherwise, in which case we can represent the sampling operation as SA. 
The following theorem is our main result regarding this subspace-preserving sampling procedure. 

Theorem 5. Let Abe annxm matrix of rank d, and let p G [1, oo). Let U be an (a, /3, p)-well-conditioned 
basis for span(^), and let us randomly sample rows of A according to the procedure described above using 
the probability distribution given by Equation ©, where r > 32f{af3)P{dln{^) + ln{^))/{p'^e'^). Then, 
with probability 1 — 6, the following holds for all x £ W^: 

I \\SAx\\^-\\Axl\<e\\Axl. 

Several things should be noted about this result. First, it implies that ra.nk{SA) = ra.nk{A), since 
otherwise we could choose a vector x G null (5^4) and violate the theorem. In this sense, this theorem 
generalizes the subspace-preservation result of Lemma 4.1 of |[T3l to all p € [l,oo). Second, regarding 
sampling complexity: if p < 2 the sampling complexity is 0{d2~^'^), if p = 2 it is 0{d'^), and if p > 

- + - -~- 4-1 

2 it is 0{ddp = 0((i^"*" ). Finally, note that this theorem is analogous to the main result of 

Schechtman |[20l . which uses the notion of Auerbach bases. 



4 The sampling algorithm 

4.1 Statement of our main algorithm and theorem 

Our main sampling algorithm for approximating the solution to the ip regression problem is presented in 
Figure [10 The algorithm takes as input an n x m matrix A of rank d, a vector b G M"^, and a number 
p e [1, oo). It is a two-stage algorithm that returns as output a vector Xqpt £ I^*" (or a vector x^ G M'" if 
only the first stage is run). In either case, the output is the solution to the induced £p regression subproblem 
constructed on the randomly sampled constraints. 

The algorithm first computes a p- well-conditioned basis U for span(j4), as described in the proof of 
Theorem |4] Then, in the first stage, the algorithm uses information from the norms of the rows of U to 
sample constraints from the input ip regression problem. In particular, roughly 0{dP~^^) rows of A, and the 
corresponding elements of b, are randomly sampled according to the probability distribution given by 

Pi = min 1 1, l^^ri I , where ri = 8^ • SQPd'' (dln(8 • 36) + ln(200)) . (6) 
[ ll|f^lllp J 

implicitly represented by a diagonal sampling matrix S, where each Su = Xjp'J^ . For the remainder of the 
paper, we will use 5 to denote the sampling matrix for the first-stage sampling probabilities. The algorithm 
then solves, using any Ip solver of one's choice, the smaller subproblem. If the solution to the induced 
subproblem is denoted x^., then, as we will see in Theorem [6l this is an 8-approximation to the original 
problemjl 



It has been brought to our attention by an anonymous reviewer that one of the main results of this section can be obtained with 
a simpler analysis. In particular, one can show that one can obtain a relative error (as opposed to a constant factor) approximation 
in one stage, if the sampling probabilities are constructed from subspace information in the augmented matrix [yl6] (as opposed to 
using just subspace information from the matrix A), i.e., by using information in both the data matrix A and the target vector h. 

^For p = 2, Drineas, Mahoney, and Muthhukrishnan show that this first stage actually leads to a (1 + e)-approximation. For 
p = 1, Clarkson develops a subgradient-based algorithm and runs it, after preprocessing the input, on all the input constraints to 
obtain a constant-factor approximation in a stage analogous to our first stage. Here, however, we solve an regression problem on 
a small subset of the constraints to obtain the constant-factor approximation. Moreover, our procedure works for all p G [1, oo). 
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Input: An n x m matrix A of rank d, a vector b G M", and p G [1, cxd). 
Let < e < 1/7, and define k = max{p/2 + l,p}. 

- Find a p-well-conditioned basis U G M"^'^ for span(yl) (as in the proof of Theorem|4l) . 

- Stage 1: Define = min |l, -Ipil^i} where ri = 8^ • SQPd'' (dln(8 • 36) + ln(200)). 

- Generate (implicitly) 5 where Sa = l/p]^^ with probability pi and otherwise. 

- Let Xc be the solution to min IISfAx — 6)|L. 

- Stage 2: Let p = Axc — b, and unless p = define qi = min |l, max ■|j^j^r2| | with 
r2 = 3^ ((iln(f ) + ln(200)). 

- Generate (implicitly, a new) T where Tu = l/g^*' with probability q^ and otherwise. 

- Let XopT be the solution to min llTfAx — 6)|L. 

Output: iopT (or Xc if only the first stage is run). 



Figure 1: Sampling algorithm for Ip regression. 

In the second stage, the algorithm uses information from the residual of the 8-approximation computed 
in the first stage to refine the sampling probabilities. Define the residual p = Ax^ — b (and note that 
\\p\\p < 8Z). Then, roughly 0{d^^^^ / e^) rows of A, and the corresponding elements of b, are randomly 
sampled according to the probability distribution 

qi = min 1 1, max jj^|p''2 1 1 , where r2 = |^dln(— ) + ln(200)J . (7) 

As before, this can be represented as a diagonal sampling matrix T, where each Tu = 1/ q^^ with probabil- 
ity qi and otherwise. For the remainder of the paper, we will use T to denote the sampling matrix for the 
second-stage sampling probabilities. Again, the algorithm solves, using any solver of one's choice, the 
smaller subproblem. If the solution to the induced subproblem at the second stage is denoted Xqpt, then, as 
we will see in Theorem|6l this is a (1 + e)-approximation to the original problemj^ 

The following is our main theorem for the Ip regression algorithm presented in Figure [T] 

Theorem 6. Let A be an n x m matrix of rank d, let b G M", and let p G [l,oo). Recall that ri = 
8^ • 36Pd'' {dln{8 ■ 36) + ln(200)) and r2 = {dHf) + ln(200)). Then, 

• Constant-factor approximation. If only the first stage of the algorithm in Figure\J}is run, then with 
probability at least 0.6, the solution Xc to the sampled problem based on the pi 's of Equation ^ is an 
8-approximation to the ip regression problem; 



*The subspace-based sampling probabilities ((Sjl are similar to those used by Drineas, Mahoney, and Muthukrishnan 1131 , while 
the residual-based sampling probabilities Q are similar to those used by Clarkson 1111 . 
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• Relative-error approximation. If both stages of the algorithm are run, then with probability at 
least 0.5, the solution Xqpt to the sampled problem based on the qi's of Equation ^ is a {\ + e)- 
approximation to the ip regression problem; 

• Running time. The i^^ stage of the algorithm runs in time O [nmd + nd^ log n + (/>(20irj , m) ), where 
(f){s, t) is the time taken to solve the regression problem min^.giRt \\A'x — where A' G M^^* is of 
rank d and b' G M'^. 

Note that since the algorithm of Figure [T] constructs the (a, /3,p)-well-conditioned basis U using the pro- 
cedure in the proof of Theorem |4l our sampling complexity depends on a and (3. In particular, it will 
be 0(d(a/3)P). Thus, if p < 2 our sampling complexity is 0{d ■ da+i) = 0{d2~^'^); if p > 2 it is 

- + - - — - 4_1 

0{d{dp 2(^9 2)P) = 0(d*'+ ); and (although not explicitly stated, our proof will make it clear that) if 
p = 2itis 0{d^). Note also that we have stated the claims of the theorem as holding with constant probabil- 
ity, but they can be shown to hold with probability at least 1 — 5 by using standard amplification techniques. 

4.2 Proof for first-stage sampling - constant-factor approximation 

To prove the claims of Theorem [6] having to do with the output of the algorithm after the first stage of 
sampling, we begin with two lemmas. First note that, because of our choice of ri , we can use the subspace 
preserving Theorem |5] with only a constant distortion, i.e., for all x, we have 

l\\Axl<\\SAxl<l\\Axl 

with probability at least 0.99. The first lemma below now states that the optimal solution to the original 
problem provides a small (constant-factor) residual when evaluated in the sampled problem. 

Lemma 7. ||5(ArcoPT — b)\\ < 3Z, with probability at least 1 — 1/3^. 

The next lemma states that if the solution to the sampled problem provides a constant-factor approxima- 
tion (when evaluated in the sampled problem), then when this solution is evaluated in the original regression 
problem we get a (slightly weaker) constant-factor approximation. 

Lemma 8. If\\S{Axc -b)\\<3Z, then \\Axc - b\\ <8Z. 

Clearly, [[^(^Xc — ^)|| < ||'S'(Axopt — ^)|| (since x^ is an optimum for the sampled £p regression prob- 
lem). Combining this with Lemmas |7] and [8j it follows that the solution Xc to the the sampled problem based 
on the Pi's of Equation ^ satisfies \\Axc — 6|| < 8 Z, i.e., Xc is an 8-approximation to the original Z. 

To conclude the proof of the claims for the first stage of sampling, note that by our choice of ri, Theo- 
rem|5]fails to hold for our first stage sampling with probability no greater than 1/100. In addition, Lemma|7] 
fails to hold with probability no grater than 1/3^, which is no greater than 1/3 for all p G [1, oo). Finally, 
let fi be a random variable representing the number of rows actually chosen by our sampling schema, and 
note that E[fi] < ri. By Markov's inequality, it follows that fi > 20ri with probability less than 1/20. 
Thus, the first stage of our algorithm fails to give an 8-approximation in the specified running time with a 
probability bounded by 1/3 + 1/20 + 1/100 < 2/5. 

4.3 Proof for second-stage sampling - relative-error approximation 

The proof of the claims of Theorem |6] having to do with the output of the algorithm after the second stage 
of sampling will parallel that for the first stage, but it will have several technical complexities that arise 
since the first triangle inequality approximation in the proof of Lemma [8] is too coarse for relative-error 
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approximation. By our choice of r2 again, we have a finer result for subspace preservation. Thus, with 
probabihty 0.99, the following holds for all x 

{l-e)\\Ax\\^<\\SAx\\^<{l + e)\\Ax\\^ 

As before, we start with a lemma that states that the optimal solution to the original problem provides a small 
(now a relative-error) residual when evaluated in the sampled problem. This is the analog of Lemma |7] An 
important difference is that the second stage sampling probabilities significantly enhance the probability of 
success. 

Lemma 9. ||T(A2;opt — &)|| < (1 + with probability at least 0.99. 

Next we show that if the solution to the sampled problem provides a relative-error approximation (when 
evaluated in the sampled problem), then when this solution is evaluated in the original regression problem 
we get a (slightly weaker) relative-error approximation. We first establish two technical lemmas. 

The following lemma says that for all optimal solutions Xqpt to the second-stage sampled problem, 
Axqpt: is not too far from Axc, where Xc is the optimal solution from the first stage, in a p-norm sense. 
Hence, the lemma will allow us to restrict our calculations in Lemmas [TT] and [T2] to the ball of radius 12 Z 
centered at Axc- 

Lemma 10. H^iopT — ^^dl <12Z. 

Thus, if we define the affine ball of radius 12 Z that is centered at Axc and that lies in span (A), 

S = {y G M" : y = ^x, X G M™, \\Axc -y\\<12Z} , (8) 

then Lemma [TO] states that Axqpt G B, for all optimal solutions Xqpt to the sampled problem. Let us 
consider an e-net, call it B^, with e = eZ, for this ball B. Using standard arguments, the size of the 
e-net is ( ^I'z^ ) = {^Y- The next lemma states that for all points in the e-net, if that point provides a 
relative-enw approximation (when evaluated in the sampled problem), then when this point is evaluated in 
the original regression problem we get a (slightly weaker) relative-error approximation. 

Lemma 11. For all points Ax s in the e-net, B^, if\\T{Axs - b)\\ < (l+3e)Z, then \\Axs - b\\ < {l+6e)Z, 
with probability 0.99. 

Finally, the next lemma states that if the solution to the sampled problem (in the second stage of sam- 
pling) provides a relative-error approximation (when evaluated in the sampled problem), then when this 
solution is evaluated in the original regression problem we get a (slightly weaker) relative-eiTor approxima- 
tion. This is the analog of Lemma [H and its proof will use Lemma [TT] 

Lemma 12. //||r(AxoPT - &)|| < (1 + e)Z, then \\Axopt - b\\ < (1 + 7e)Z. 

Cleai^ly, ||T(j4xopt — b)\\ < ||T(Axopt — fe)||> since Xqpt is an optimum for the sampled £p regression 
problem. Combining this with Lemmas |9]and[T2j it follows that the solution Xqpt to the the sampled problem 
based on the g^'s of Equation dV]) satisfies ||Axopt — b\\ < (1 + e) Z, I.e., 37opT 1^ ^ 

(1 + e)-approximation to 

the original Z. 

To conclude the proof of the claims for the second stage of sampling, recall that the first stage failed 
with probability no greater than 2/5. Note also that by our choice of r2. Theorem [5] fails to hold for our 
second stage sampling with probability no greater than 1/100. In addition. Lemma |9] and Lemma [TT] each 
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fails to hold with probability no greater than 1/100. Finally, let f2 be a random variable representing the 
number of rows actually chosen by our sampling schema in the second stage, and note that E[f2\ < 2r2. By 
Markov's inequality, it follows that f2 > 40r2 with probability less than 1/20. Thus, the second stage of 
our algorithm fails with probability less than 1/20 + 1/100 + 1/100 + 1/100 < 1/10. By combining both 
stages, our algorithm fails to give a (1 + e) -approximation in the specified running time with a probability 
bounded from above by 2/5 + 1/10 = 1/2. 

5 Extensions 

In this section we outline several immediate extensions of our main algorithmic result. 

Constrained tp regression. Our sampling strategies are transparent to constraints placed on x. In partic- 
ular, suppose we consti'ain the output of our algorithm to lie within a convex set C C M™. If there is an 
algorithm to solve the constrained ip regression problem min^gc ll^'^^ — &'||> where A' G R''^"* is of rank 
d and b' G W, in time (p{s,'m), then by modifying our main algorithm in a straightforward manner, we 
can obtain an algorithm that gives a (1 + e)-approximation to the constrained ip regression problem in time 
0{nmd + nd^ log n + </)(40r2, m)). 

Generalized £p regression. Our sampling strategies extend to the case of generalized £p regression: given 
as input a matrix A G ]^"x™ of rank d, a target matrix B G IR"^^, and a real number p G [1, oo), find a 
matrix X G W^^p such that \\\AX — B\\\p is minimized. To do so, we generalize our sampling strategies in 
a straightforward manner. The probabilities pi for the first stage of sampling aie the same as before. Then, 
if Xc is the solution to the first-stage sampled problem, we can define the n x p matrix p = AXc — B, 
and define the second stage sampling probabilities to be qi = min (1, maxjpj, r2||/Oj*||p/|||/5|||p}). Then, 
we can show that the Xqpt computed from the second-stage sampled problem satisfies |||^Xopt — B\\\p < 
(1 + e) minj^g^mxp \\\AX — B\\\p, with probability at least 1/2. 

Weighted ip regression. Our sampling strategies also generalize to the case of ip regression involving 
weighted p-norms: if wi, . . . , Wm are a set of non-negative weights then the weighted p-norm of a vector 
X G M™" may be defined as = ' "^he weighted analog of the matrix p-norm 

\\\-\\\p may be defined as |||f/|||p,«; = ( Y^j=i Hp „, ) ■ Our sampling schema proceeds as before. First, 

we compute a "well-conditioned" basis U for span(A) with respect to this weighted p-norm. The sampling 

probabilities pi for the first stage of the algorithm are then pi = min ^l,riWj /|||f^|||p,«;^ , and the 

sampling probabilities qi for the second stage ai^e qi = min (1, max{pj, r2ti'i|/5i|*'/||p||p,i„}), where p is the 
residual from the first stage. 

General sampling probabilities. More generally, consider any sampling probabihties of the form: pi > 

min |l, max 11^, ^^f^}r}, where povT = Axopt - b and r > (dln(f ) + ln(200)) and 

where we adopt the convention that ^ = 0. Then, by an analysis similar to that presented for our two 
stage algorithm, we can show that, by picking ©(SG^d^+^/e^) rows of A and the corresponding elements 
of b (in a single stage of sampling) according to these probabilities, the solution Xqpt to the sampled ip 
regression problem is a (1 + e)-approximation to the original problem, with probability at least 1/2. (Note 
that these sampling probabilities, if an equality is used in this expression, depend on the entries of the vector 
PoPT = AxopT — b; in particular, they require the solution of the original problem. This is reminiscent of 
the results of llT3l . Our main two-stage algorithm shows that by solving a problem in the first stage based 
on coarse probabilities, we can refine our probabilities to approximate these probabilities and thus obtain an 
(1 + e) -approximation to the ip regression problem more efficiently.) 
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Pr [Y < E[Y] - 7] < exp ( ) . (9) 



A Tail inequalities 

With respect to tail inequalities, we will use the following version of the Bernstein's inequality. 

Theorem 13 ( IHSI lTl). Let {Xj}"^^ be independent random variables with E[Xf\ < 00 and Xi > 0. Set 
Y = Xi and let 7 > 0. Then 

IfXi - E[Xi] < Afar all i, then with af = E[Xf] - E[Xi]^ we have 

B Proofs for Section |3] 
B.l Proof of Theorem g] 

Proof. For simplicity of presentation, in this proof we will generally drop the subscript from our matrix and 
vector |j-norms; i.e., unsubscripted norms will be p-norms. Note that it suffices to prove that, for all x E M™, 

(1 - e)P \\Ax\\P < \\SAx\\P < (1 + e)P \\Ax\\p , (11) 

with probability 1 — 5. To this end, fix a vector x S M™, define the random variable Xi = (5ji|j4j^x|)^, and 
recall that Ai^ = Uii^T since A = Ut. Clearly, Y17=i = ll'S'^^;!!^. In addition, since E[Xi] = \Ai^x\P, it 
follows that Y17=i ^[^i] = To bound Equation ^TB, first note that 

n 

- E[X,]) = J2 (Xi- E[Xi]) . (12) 

i=l i:pi<l 

Equation [12] follows since, according to the definition of pi in Equation ([5]), pi may equal 1 for some rows, 
and since these rows are always included in the random sample, Xi = for these rows. To bound the 

right hand side of Equation [T2j note that for all i such that pi < 1, 

\Aii,xf /pi < \\tx\\p /pi (by Holders inequality) 

< lllf^lllp ll^llg A (by Equation ©) 

< {a/Sf \\Ax\\P /r (by Definition |3] and Theoremg]) . (13) 

From Equation ([T3] ). if follows that for each i such that pi < 1, 

X, - E[X,] <X,< \Ai,x\P/pi < {a(3r \\Axf /r; 
Thus, we may define A = 11^^ IT addition, it also follows from Equation (fTSl ) that 



1*2; 1^ ■ 



Pi 

i:pi<l t-Pi<l 



< " " \Ai^x\P (by Equation (O) 



< {apY \\Ax\\'P /r 



i:pi<l 

2p 
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from which it follows that J2r.p,<i ^? < J2i:p,<i E[Xf] < {a(3)P \\AxfP jr. 

To apply the upper tail bound in Theorem [T3l define 7 = ((1 + e/4)P — 1) It follows that 

7^ > (pe/4)2 11^x11^^ and also that 

2 ^ CT,2 + 27A/3 < 2(a/3)f PxfP/r + 2((l + e/4f-l)(a/3f PxfP/3r 
< 32P {a(3Y \\AxfP /r, 

where the second inequality follows by standard manipulations since e < 1 and since p > 1. Thus, by 
Equation (ITOl) of Theorem [131 it follows that 



FrlWSAxf > \\Axf + j] = Pr 






+ 7 




i:pi<l 


i:pi<l 






-7 



2Ei:p,<i< + 27A/3; 

Similarly, to apply the lower tail bound of Equation ^ of Theorem [T3l define 7 = (1 — (1 — e/4)^) 
Since 7 > e || /4, we can follow a similar line of reasoning to show that 



Pr[||5^x||^ < ||Ax||^-7] < 



exp 



-7 



2y. 



<i' 



< exp {-e^r/{a(3Y2,2) . 

Choosing r > 32P{apy{dln{^) + lnQ))/{p^e^), we get that for every fixed x, the following is true with 
probability at least 1 — S: 

(1 - e/4)P \\Axf < \\SAxf < (1 + e/4)P \\Axf . 

Now, consider the ball = {y G : y = Ax, \\y\\ < 1} and consider an e-net for B, with e = e/4. 
The number of points in the e-net is (— ) . Thus, by the union bound, with probability 1 — 5, Equation (fTTl ) 
holds for all points in the e-net. Now, to show that with the same probability Equation (fTTl) holds for all 
points y ^ B, let y* G B be such that | \\Sy\\ — \\y\\ \ is maximized, and let i] = sup{| HSyH — ||y|| | : y G B}. 
Also, let y* ^ Bhe the point in the e-net that is closest to y*. By the triangle inequality, 

V = \\\Sy*\\ - \\y*\\\ = IWSyt + Siy* - y,*)|| - Wy*, + (y* - y^ 

<|||%*|| + ||5(2/*-y:)||-||y:|| + 2||y*-y* 

<\\\Sy:\\-\\ym + \\\s{y*-ym-\\y*-y*e 

<e/A\\y*\\+eri/A + e/2 , 

where the last inequality follows since ||y* — y*\\ < e, {y* — yl)/e G B, and 

|||5(y*-y*)/e||-||(y*-y:)/e|||<^ . 

Therefore, r] < e since ||y*|| < 1 and since we assume e < 1/7. Thus, Equation (ITTI ) holds for all points 
y ^ B, with probability at least 1 — 5. Similarly, it holds for any y G M" such that y = Ax, since 
y/ \\y\\ G B and since ||S'(y/ ||y||) — y/ \\y\\\\ < e implies that \\Sy — y\\ < e ||y||, which completes the 
proof of the theorem. □ 



- \\y - ys\\\ 

I o 1 1 * * 

+ 2||y -ye 
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C Proofs for Section! 

As in the proof of Theorem |5l unsubscripted norms will be p-norms. 



C.l Proof of Lemma El 



Proof. Define Xi = (S'jj|^j^XoPT — bi\y. Thus, Ylii-^i — II'S'(^2;opt — i>)\\^^ and the first moment is 
£^[^^ Xi] = \\Axopt: — b\\^ = Z. The lemma follows since, by Markov's inequality. 



Pr 



< 



3P 



i.e., ||5(^xc 



> 3^ II^XopT — ^11^, with probability no more than 1/3^. 



□ 



C.2 Proof of Lemma m 

Proof. We will prove the contrapositive: If \\Axc — h\\ > SZ, then [[^(^Xc — ^)|| > 3 2^. To do so, note 
that, by Theorem |5l and the choice of r\, we have that 



Ax\\„ < \\SAx\\„ < - \\Ax\ 



\p — 



Using this, 

\\SiAx, 



> \\SA{x,-Xo,t)\\-\\S{Axc 



(by the triangle inequality) 
(by Theorem |5] and Lemma |7]) 

(by the triangle inequality) 

(by the premise \\Axc — b\\ > 8Z) 



7 

— 'S ll^-^c ~ ^2;opt|| — 'i Z 
o 

>l{\\Ax,-b\\ - PxoPT-6||)-3^ 

o 

> 1{8Z - Z) -3Z 
8 

> 3Z, 

which establishes the lemma. □ 
C.3 Proof of Lemma |9] 

Proof. Define the random variable Xi = (Tjj|^jv,XopT — bi\y, and recall that Ai^, = Uii,T since A = 
Ut. Clearly, EILi = \\T{Axopt - b)]]^. In addition, since E[Xi] = l^i^XoPT - bi\P, it follows 
that Y17=i ^l^i] = ||^3;opT — b\\^- We will use Equation (ITOl ) of Theorem [13] to provide a bound for 

{X^ - EIX,]) = \\T{Axo,T - b)r - \\Axo,T - bf. 

From the definition of qi in Equation (|7]), it follows that for some of the rows, qi may equal 1 (just as in 
the proof of Theorem©. Since Xi = E[Xi] for these rows, Yli i^i - E[Xi]) = Y.i:p,<i i^i - E[Xi]), 
and thus we will bound this latter quantity with Equation (flOl ). To do so, we must first provide a bound for 
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Xi - E[Xi] < Xi and for Ei:p,<i <EiE [Xf] . To that end, note that: 

\Ai^{xopT - Xc)\ < ||f/i*||p ||t(xopt - Xc)\\q (by Holders inequality) 

^ l|f^i*llp/5 ll^''"(^opT — ic)||p (by Definition [3] and Theorem IHl 

< \\Ui4pp{\\AxoPT-b\\ + \\Axc-b\\) (by the triangle inequality) 

< ||C/.*||p/39Z , (14) 

where the final inequality follows from the definition of Z and the results from the first stage of sampling. 
Next, note that from the conditions on the probabilities qi in Equation ([7]), as well as by Definition [3] and the 
output of the first-stage of sampling, it follows that 

\pi\P \\p\\P SPZP , \\UiJ\P \\\U\\\P aP 

\HJA_ ^ IIPII_ ^ ll_«ll_ ^ lll_lll_ ^ _ ^ (.^5^ 

Qi r2 r2 Qi r2 r2 

for all i such that qi < 1. 

Thus, since X^ — E[Xi] < Xi < l^j^Xopx — bi\P/qi, it foUows that for all i such that qi < 1, 

np-l 

Xi - E[Xi] < (|Ai^(xoPT - Xc)\P + \pi\P) (since p = Axc-b) (16) 

Qi 

< 2P~' ( W^^^'^'^" + \£jI\ (by Equation d) 

\ Qi Qi J 

< 2^-1 {aP(3P9PZP + S^ZP) /ra (by Equation ([Bll) 

< Cp{ap)PZP/r2 , (17) 

where we set Cp = 2P~^{9P + 8^). Thus, we may define A = Cp{a(3)P ZP / r2. In addition, it follows that 

E ^ra = E i-4..^oPT-6.r '^--"°^^~^''' 

< A ^l^i^XoPT - (by Equation (fTTl) ) 

i 

To apply the upper tail bound of Equation (fTOl) of Theorem [T3l define 7 = ((1 + e)P — 1)ZP. We have 
7 > P^ZP, and since e < 1/7, we also have 7 < ((|)^ — l) ZP. Hence, by Equation (ITOl ) of Theorem [T3l 
it follows that 

-7^ 



lnPr[||r(^XoPT-6)ir> PxoPT-?^ir + 7] < 2 



2E.p,<i< + 27A/3 



- 36P(a/?)P 



Thus, Pr [||T(AxopT ~ ^)ll > (1 + ^)^] ^ exp 36P(a/3)p j ' frorn which the lemma follows by our choice 
of r2. □ 
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C.4 Proof of Lemma M 

Proof. By two applications of the triangle inequality, it follows that 

ll^^iopT ~ ^2^c|| ^ ll^^iopT ~ ^3;opt|| + H^^^opt ~ b\\ + ||j4xc — 6|| 
< II^XoPT - ^SoptII + 9Z , 

where the second inequality follows since \\Axc — b\\ < 8Z from the first stage of sampling and since 
Z = II^XopT — In addition, we have that 

PxoPT - ^XoptII < ^ , \\T{AxopT - Axopt)\\ (by Theorem ID 

(1 -e) 

< (1 + e) {\\T{AxovT -b)\\ + ||T(^XoPT - ^)||) (by the triangle inequality) 

< 2(1 + e)||r(A2;oPT- 6)11 

< 2{1 + ef \\AxoPT - b\\ (byLemmaill , 

where the third inequality follows since Xopt is optimal for the sampled problem. The lemma follows since 
e<l/7. □ 

C.5 Proof of Lemma [11] 

Proof. Fix a given point y* = Ax* G Sg. We will prove the contrapositive for this point, i.e., we will prove 
that if \\Ax* - b\\ > (1 + 6e)Z, then \\T{Ax* - > (1 + 3e)Z, with probability at least 1 - (^)'^. 
The lemma will then follow from the union bound. 

To this end, define the random variable Xi = {Tii\Aii,xl — bi\y, and recall that Ai^, = Uii,T since 
A = Ut. Clearly, EILi = 11^(^2;* - 6)f . In addition, since E[Xi] = \A.uxl - bi\P, it follows that 
Y17=i ^[-^i] — W^^e ~ b\\^- ^ill Equation ^ of Theorem [T3] to provide an upper bound for the 
eventthat \\T{Ax* - b)f < \\Axl - 6f -7, where7 = \\Ax* - bf-{l + 3eyZP, under the assumption 
that \\Ax* - b\\ > {l + 6e)Z. 

From the definition of qi in Equation ([7]l, it follows that for some of the rows, qi may equal 1 (just as in 
the proof of Theorem|5]l. Since Xi = E[Xi] for these rows, Y.i i^i - E[Xi]) = Y.i,p,<i i^i - E[Xi]), 
and thus we will bound this latter quantity with Equation Q. To do so, we must first provide a bound for 

Si:pi<l ^ i^i] ■ ^° ^^"^ ^^^^ 

- Xc)\ < \\Ui^\\p ||'r(x* - Xc)\\q (by Holders inequality) 

< \\Ui^\\pl3\\UT{xl - Xc)\\p (by Definition [3] and TheoremBJl 

< \\Ui4f312Z , (19) 

where the final inequality follows from the radius of the high-dimensional ball in which the e-net resides. 
From this, we can show that 

\ A- T* — h-\ 2P~^ 

^ < {\A,xl - A^xcl^ + \pi\P) (since p = Axc-b) 

qi Qi 

< 2^-1 ( WU^^W""^'^"/^"^" + \Pit) (by Equation GUl) 

\ Qi Qi J 

< 2P^^ {aPl2Pf3PZP + 8PZP) /ra (by Equation O) 

< 24:P{aP)PZP/r2 . (20) 
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Therefore, we have that 

i:qi<l i:qi<l '^^ 

< ^^^^^^^^Y\Ai^xl-b,\P (by Equation (EOl)) 

< 24:P{af3)P\\Axl-b\\^P /r2. (21) 

To apply the lower tail bound of Equation ^ of Theorem [T3l define 7 = \\Axl — — (1 + SeyZ^. Thus, 
by Equation (|2T]) and by Equation (O of Theorem [13] it follows that 

inPr - m' < (1 + ^.f^n < --^(m^; - c - c + w 



-r2 / (l+3e)PZP\^ 

< 7 — TTT- 1 — 1 ^ (by the premise) 

— r2e^ 

< — -— (since e < 1/3). 

Since r2 > 24P(a/J)P((i ln(f ) + ln(200))/e2, it follows that \\T{Ax* - < (1 + 3e)Z, with probability 

no greater than ^ (;^)'^- Since there are no more than (^)'^ such points in the e-net, the lemma follows 
by the union bound. □ 

C.6 Proof of Lemma M 

Proof. We will prove the contrapositive: If ||^Xopt - ^|| > (1 + 7e)Z then ||r(^XoPT - b)\\ > (1 + e)Z. 
Since Axopj lies in the ball B defined by Equation ^ and since the e-net is constructed in this ball, there 
exists a point = Ax^, call it Ax^, such that ||j42;opt — < eZ. Thus, 

\\Axl — 6|| > ll^fopT — ^11 — 11^2;* — AxoptII (by the triangle inequality) 

> (1 + 7e)Z — eZ (by assumption and the definition of Ax^ ) 

= (l + 6e)Z . 

Next, since Lemma [TT] holds for all points Ax^ in the e-net, it follows that 

\\T{Ax*-b)\\ > (l + 3e)Z . (22) 

Finally, note that 

||r(AxoPT -b)\\> \\T{Ax* - 6)11 - \\TA{x* - Xopt)|| (by the triangle inequality) 

> (1 + 3e)Z - (1 + e) \\A{x* - Xopt)|| (by Equation ([22]) and TheoremEJ 

> (1 + 3e)Z - (1 + e)e Z (by the definition of Axe) 

which establishes the lemma. □ 
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